{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3fea0e5f",
   "metadata": {},
   "source": [
    "## 昇思+昇腾开发板：软硬结合玩转大模型实践能力认证（初级）\n",
    "\n",
    "**环境准备：**\n",
    "\n",
    "开发者拿到香橙派开发板后，首先需要进行硬件资源确认、镜像烧录以及CANN和MindSpore版本的升级，才可运行该案例，具体如下：\n",
    "\n",
    "|**香橙派AIpro**|**镜像**|**CANN Toolkit/Kernels**|**MindSpore**|**MindSpore NLP**|\n",
    "|:-------:|:-------:|:-------:|:-------:|:-------:|\n",
    "|20T 24G|Ubuntu|8.0.0beta1|2.5.0|0.4分支|\n",
    "\n",
    "- CANN检查与升级：参考[链接](https://www.mindspore.cn/tutorials/zh-CN/r2.6.0/orange_pi/environment_setup.html#3-cann%E5%8D%87%E7%BA%A7)\n",
    "- MindSpore检查与升级：参考[链接](https://www.mindspore.cn/tutorials/zh-CN/r2.6.0/orange_pi/environment_setup.html#4-mindspore%E5%8D%87%E7%BA%A7)\n",
    "- MindSpore NLP安装命令：\n",
    "    ```bash\n",
    "    pip install git+https://github.com/mindspore-lab/mindnlp.git@0.4\n",
    "    ```\n",
    "\n",
    "**场景说明：** 在本次实践中，我们将基于MindSpore NLP对DeepSeek-R1-Distill-Qwen-1.5B模型进行LoRA微调并推理，使得模型可以模仿《甄嬛传》中甄嬛的口吻进行对话。其中，LoRA（Low-Rank Adaptation）是一种参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法。其核心思想是冻结原始网络参数，对Attention层中QKV等模块添加旁支。旁支包含两个低维度的矩阵A和矩阵B，微调过程中仅更新A、B 矩阵。通过这种方式，显著降低计算和内存成本，同时达到与全参数微调相近的效果。\n",
    "\n",
    "模型链接：https://modelers.cn/models/MindSpore-Lab/DeepSeek-R1-Distill-Qwen-1.5B-FP16\n",
    "\n",
    "**考核目标：** 本次实践旨在考核基于MindSpore NLP对大模型微调推理流程的掌握，共6个考点。考生需补全空缺处的代码，保证执行脚本全流程跑通。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff41ab0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import mindspore\n",
    "import mindnlp\n",
    "from mindnlp.transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "from mindnlp.engine import TrainingArguments, Trainer\n",
    "from mindnlp.dataset import load_dataset\n",
    "from mindnlp.transformers import GenerationConfig\n",
    "from mindnlp.peft import LoraConfig, TaskType, get_peft_model, PeftModel\n",
    "\n",
    "from mindnlp.engine.utils import PREFIX_CHECKPOINT_DIR\n",
    "from mindnlp.configs import SAFE_WEIGHTS_NAME\n",
    "from mindnlp.engine.callbacks import TrainerCallback, TrainerState, TrainerControl\n",
    "\n",
    "from mindspore._c_expression import disable_multi_thread\n",
    "disable_multi_thread()\n",
    "\n",
    "# 开启同步，用于定位问题，调试完毕后建议关闭同步\n",
    "# mindspore.set_context(pynative_synchronize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa0189e1",
   "metadata": {},
   "source": [
    "### 获取并加载数据集\n",
    "\n",
    "本次实践使用了huanhuan数据集，该数据集从《甄嬛传》的剧本进行整理，从原始文本中提取出我们关注的角色的对话，并形成 QA 问答对，最终整理为json格式的数据，数据样本示例如下：\n",
    "\n",
    "```text\n",
    "[\n",
    "    {\n",
    "        \"instruction\": \"小姐，别的秀女都在求中选，唯有咱们小姐想被撂牌子，菩萨一定记得真真儿的——\",\n",
    "        \"input\": \"\",\n",
    "        \"output\": \"嘘——都说许愿说破是不灵的。\"\n",
    "    },\n",
    "]\n",
    "```\n",
    "\n",
    "使用`openmind_hub`接口下载数据集，并通过`load_dataset`加载。\n",
    "\n",
    "数据集链接：https://modelers.cn/datasets/MindSpore-Lab/huanhuan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a92bdba",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install openmind_hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c792a14",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openmind_hub import om_hub_download\n",
    "\n",
    "# 从魔乐社区下载数据集\n",
    "om_hub_download(\n",
    "    repo_id=\"MindSpore-Lab/huanhuan\",\n",
    "    repo_type=\"dataset\",\n",
    "    filename=\"huanhuan.json\",\n",
    "    local_dir=\"./\",\n",
    ")\n",
    "\n",
    "# 加载数据集\n",
    "dataset = load_dataset(path=\"json\", data_files=\"./huanhuan.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe174a91",
   "metadata": {},
   "source": [
    "### 实例化tokenizer\n",
    "\n",
    "创建一个分词器`tokenizer`，并配置其填充标记和填充位置。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8a7f32d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 实例化tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"MindSpore-Lab/DeepSeek-R1-Distill-Qwen-1.5B-FP16\", mirror=\"modelers\", use_fast=False, ms_type=mindspore.float16)\n",
    "tokenizer.pad_token = tokenizer.eos_token\n",
    "tokenizer.padding_side = 'right'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f578e41f",
   "metadata": {},
   "source": [
    "### **考点1：数据处理**\n",
    "\n",
    "`process_func`函数将原始的对话数据转换为适合模型微调的格式，后续为节约时间，使用`take`接口对数据集进行裁剪。\n",
    "\n",
    "**要求：** 请将定义好的数据预处理函数应用于数据集`dataset`上，处理后的数据集保存在`formatted_dataset`中，并打印预处理后的数据，期望打印结果如下：\n",
    "```text\n",
    "User: 小姐，别的秀女都在求中选，唯有咱们小姐想被撂牌子，菩萨一定记得真真儿的——\n",
    "\n",
    "Assistant: 嘘——都说许愿说破是不灵的。<｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜><｜end▁of▁sentence｜>\n",
    "```\n",
    "\n",
    "**参考文档：** 具体实现参考MindSpore官网2.5.0版本学习-教程-快速上手（数据加载与处理）中的数据变换，或者参考[昇思+昇腾开发板：软硬结合玩转DeepSeek开发实战课程](https://www.hiascend.com/developer/courses/detail/1925362775376744449)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32e576eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义数据处理逻辑\n",
    "def process_func(instruction, input, output):\n",
    "    MAX_SEQ_LENGTH = 64  # 最长序列长度\n",
    "    input_ids, attention_mask, labels = [], [], []\n",
    "    # 首先生成user和assistant的对话模板\n",
    "    # User: instruction + input\n",
    "    # Assistant: output\n",
    "    formatted_instruction = tokenizer(f\"User: {instruction}{input}\\n\\n\", add_special_tokens=False)\n",
    "    formatted_response = tokenizer(f\"Assistant: {output}\", add_special_tokens=False)\n",
    "    # 最后添加 eos token，在deepseek-r1-distill-qwen的词表中， eos_token 和 pad_token 对应同一个token\n",
    "    # User: instruction + input \\n\\n Assistant: output + eos_token\n",
    "    input_ids = formatted_instruction[\"input_ids\"] + formatted_response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
    "    # 注意相应\n",
    "    attention_mask = formatted_instruction[\"attention_mask\"] + formatted_response[\"attention_mask\"] + [1]\n",
    "    labels = [-100] * len(formatted_instruction[\"input_ids\"]) + formatted_response[\"input_ids\"] + [tokenizer.pad_token_id]\n",
    "\n",
    "    if len(input_ids) > MAX_SEQ_LENGTH:\n",
    "        input_ids = input_ids[:MAX_SEQ_LENGTH]\n",
    "        attention_mask = attention_mask[:MAX_SEQ_LENGTH]\n",
    "        labels = labels[:MAX_SEQ_LENGTH]\n",
    "\n",
    "    # 填充到最大长度\n",
    "    padding_length = MAX_SEQ_LENGTH - len(input_ids)\n",
    "    input_ids = input_ids + [tokenizer.pad_token_id] * padding_length\n",
    "    attention_mask = attention_mask + [0] * padding_length  # 填充的 attention_mask 为 0\n",
    "    labels = labels + [-100] * padding_length  # 填充的 label 为 -100\n",
    "    \n",
    "    return input_ids, attention_mask, labels\n",
    "\n",
    "# >>>>>>> 题目：将定义好的预处理函数应用于数据集上 <<<<<<<\n",
    "formatted_dataset = ________\n",
    "\n",
    "# 查看预处理后的数据\n",
    "for input_ids, attention_mask, labels in formatted_dataset.create_tuple_iterator():\n",
    "    print(tokenizer.decode(input_ids))\n",
    "    break\n",
    "\n",
    "# 为节约时间，将数据集裁剪\n",
    "truncated_dataset = formatted_dataset.take(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebab2f2b",
   "metadata": {},
   "source": [
    "### **考点2：LoRA配置**\n",
    "\n",
    "加载预训练模型权重和生成配置，并通过`LoraConfig`配置LoRA参数。\n",
    "\n",
    "**要求：** 请补齐LoRA配置中的Lora秩和Lora alpha，使得考点3中获取到的参与训练的参数量不超过总参数量的0.52%。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d994f0ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_id = \"MindSpore-Lab/DeepSeek-R1-Distill-Qwen-1.5B-FP16\"\n",
    "\n",
    "base_model = AutoModelForCausalLM.from_pretrained(model_id, mirror=\"modelers\", ms_dtype=mindspore.float16)\n",
    "base_model.generation_config = GenerationConfig.from_pretrained(model_id, mirror=\"modelers\")\n",
    "\n",
    "base_model.generation_config.pad_token_id = base_model.generation_config.eos_token_id\n",
    "\n",
    "# >>>>>>> 题目：LoRA配置，补齐Lora秩、Lora alpha <<<<<<<\n",
    "\n",
    "config = LoraConfig(\n",
    "    task_type=TaskType.CAUSAL_LM, \n",
    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
    "    inference_mode=False, # 训练模式\n",
    "    r=________, # Lora 秩\n",
    "    lora_alpha=________, # Lora alpha，具体作用参见 Lora 原理\n",
    "    lora_dropout=0.1# Dropout 比例\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92a5a520",
   "metadata": {},
   "source": [
    "### **考点3：实例化LoRA模型**\n",
    "\n",
    "实例化LoRA模型，打印训练参数量占比，并定义回调类`SavePeftModelCallback`，保存训练过程中的lora adapter权重。\n",
    "\n",
    "**要求：** 请使用已有的MindSpore NLP接口`get_peft_model`，加载LoRA配置，实例化LoRA模型。\n",
    "\n",
    "**参考文档：** https://github.com/mindspore-lab/mindnlp/blob/0.4/docs/en/tutorials/peft.md"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b4afda6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# >>>>>>> 题目：实例化LoRA模型 <<<<<<<\n",
    "model = ________\n",
    "\n",
    "# 获取模型训练参数占比数，发现仅占总参数量的0.516%\n",
    "total_params = 0\n",
    "lora_params = 0\n",
    "for param in model.trainable_params():\n",
    "    lora_params += param.size\n",
    "for param in model.get_parameters():\n",
    "    total_params += param.size\n",
    "print('proportion of parameters: ', lora_params / total_params)\n",
    "\n",
    "class SavePeftModelCallback(TrainerCallback):\n",
    "    def on_save(\n",
    "        self,\n",
    "        args: TrainingArguments,\n",
    "        state: TrainerState,\n",
    "        control: TrainerControl,\n",
    "        **kwargs,\n",
    "    ):\n",
    "        checkpoint_folder = os.path.join(\n",
    "            args.output_dir, f\"{PREFIX_CHECKPOINT_DIR}-{state.global_step}\"\n",
    "        )       \n",
    "\n",
    "        # 保存adapter weights\n",
    "        peft_model_path = os.path.join(checkpoint_folder, \"adapter_model\")\n",
    "        # 保存训练过程中的lora adapter权重\n",
    "        kwargs[\"model\"].save_pretrained(peft_model_path, safe_serialization=True)\n",
    "\n",
    "        # 删除base model的saeftensors权重，节约更多空间\n",
    "        base_model_path = os.path.join(checkpoint_folder, SAFE_WEIGHTS_NAME)\n",
    "        os.remove(base_model_path) if os.path.exists(base_model_path) else None\n",
    "\n",
    "        return control\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f44b7f0e",
   "metadata": {},
   "source": [
    "### 启动微调\n",
    "\n",
    "配置微调超参，并执行微调。执行完微调后，可在`./output/DeepSeek-R1-Distill-Qwen-1.5B`中找到`checkpoint-3`的文件夹，内有保存微调后的LoRA adapter权重。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f21e9a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "args = TrainingArguments(\n",
    "    output_dir=\"./output/DeepSeek-R1-Distill-Qwen-1.5B\",\n",
    "    per_device_train_batch_size=1,\n",
    "    logging_steps=1,\n",
    "    num_train_epochs=1,\n",
    "    save_steps=3,\n",
    "    learning_rate=1e-4,\n",
    ")\n",
    "\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=args,\n",
    "    train_dataset=truncated_dataset,\n",
    "    callbacks=[SavePeftModelCallback],\n",
    ")\n",
    "\n",
    "trainer.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12e85a69",
   "metadata": {},
   "source": [
    "### 模型推理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d83c0075",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install gradio==4.44.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c1db5d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import gradio as gr\n",
    "import mindspore\n",
    "from mindnlp.transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "from mindnlp.transformers import TextIteratorStreamer\n",
    "from threading import Thread"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0418778",
   "metadata": {},
   "source": [
    "### **考点4：模型实例化**\n",
    "\n",
    "加载tokenizer和预训练模型model，并在model的基础上使用PeftModel加载微调后的LoRA adapter权重。\n",
    "\n",
    "**要求：** 请**使用MindSpore NLP的API接口**`AutoModelForCausalLM`和`AutoTokenizer`，实例化`tokenizer`和模型`model`，镜像地址为`modelers`，模型ID为`MindSpore-Lab/DeepSeek-R1-Distill-Qwen-1.5B-FP16`，数据类型为`float16`\n",
    "\n",
    "**参考文档：** https://github.com/mindspore-lab/mindnlp/blob/0.4/docs/en/tutorials/quick_start.md"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d4474c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# >>>>>>> 题目：实例化tokenizer和模型，镜像地址为modelers，模型ID为\"MindSpore-Lab/DeepSeek-R1-Distill-Qwen-1.5B-FP16\"，数据类型为float16 <<<<<<<\n",
    "# >>>>>>> 补全实例化tokenizer和模型的代码 <<<<<<<\n",
    "tokenizer = ________\n",
    "model = ________\n",
    "model = PeftModel.from_pretrained(model, \"./output/DeepSeek-R1-Distill-Qwen-1.5B/checkpoint-3/adapter_model/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "908a4787",
   "metadata": {},
   "source": [
    "### **考点5：构建输入对话**\n",
    "\n",
    "`build_input_from_chat_history`函数用于根据聊天历史记录和当前消息构建一个消息列表，将聊天历史和当前消息整合成一个特定格式的列表，以便后续处理或传递给聊天机器人模型。\n",
    "\n",
    "**要求：** 请补齐`build_input_from_chat_history`函数中对话历史记录的输入格式，即字典中`content`对应的内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79a39864",
   "metadata": {},
   "outputs": [],
   "source": [
    "system_prompt = \"You are a helpful and friendly chatbot\"\n",
    "\n",
    "def build_input_from_chat_history(chat_history, msg: str):\n",
    "    messages = [{'role': 'system', 'content': system_prompt}]\n",
    "    for user_msg, ai_msg in chat_history:\n",
    "        # >>>>>>> 题目：补齐对话历史记录的输入格式，即字典中“content”对应的内容 <<<<<<<\n",
    "        messages.append({'role': 'user', 'content': ________})\n",
    "        messages.append({'role': 'assistant', 'content': ________})\n",
    "    messages.append({'role': 'user', 'content': msg})\n",
    "    return messages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca63ba52",
   "metadata": {},
   "source": [
    "### **考点6：设置generate参数**\n",
    "\n",
    "`predict`函数用于预测生成文本，它接收用户的消息和对话历史，然后生成一个回应，其中`generate_kwargs`包含生成文本所需的参数。\n",
    "\n",
    "**要求：** 请补齐`predict`函数中`generate_kwargs`的参数，包括最大生成长度、top-p采样参数、重复惩罚系数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f54a04c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 预测生成文本\n",
    "def predict(message, history):\n",
    "    history_transformer_format = history + [[message, \"\"]]\n",
    "\n",
    "    # 构建输入消息列表\n",
    "    messages = build_input_from_chat_history(history, message)\n",
    "    input_ids = tokenizer.apply_chat_template(\n",
    "            messages,\n",
    "            add_generation_prompt=True,\n",
    "            return_tensors=\"ms\",\n",
    "            tokenize=True\n",
    "        )\n",
    "    streamer = TextIteratorStreamer(tokenizer, timeout=300, skip_prompt=True, skip_special_tokens=True)\n",
    "    generate_kwargs = dict(\n",
    "        input_ids=input_ids,\n",
    "        streamer=streamer,\n",
    "        # >>>>>>> 题目：设置最大生成长度 <<<<<<<\n",
    "        max_new_tokens=________,\n",
    "        do_sample=True,\n",
    "        # >>>>>>> 题目：设置top-p采样参数 <<<<<<<\n",
    "        top_p=________,\n",
    "        temperature=0.1,\n",
    "        num_beams=1,\n",
    "        # >>>>>>> 题目：设置重复惩罚系数 <<<<<<<\n",
    "        repetition_penalty=________\n",
    "    )\n",
    "    t = Thread(target=model.generate, kwargs=generate_kwargs)\n",
    "    t.start()  # 在单独的线程中执行生成\n",
    "    partial_message = \"\"\n",
    "    for new_token in streamer:\n",
    "        partial_message += new_token\n",
    "        if '</s>' in partial_message:  # 如果出现结束标记，则终止循环\n",
    "            break\n",
    "        yield partial_message\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bd58c2d",
   "metadata": {},
   "source": [
    "### 启动Gradio聊天界面\n",
    "\n",
    "创建一个简单的聊天机器人界面，用于对话问答。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e908e35",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 设置 Gradio 聊天界面\n",
    "gr.ChatInterface(predict,\n",
    "                 title=\"DeepSeek-R1-Distill-Qwen-1.5B\",\n",
    "                 description=\"问几个问题\",\n",
    "                 examples=['你是谁？', '你能做什么？']\n",
    "                 ).launch()  # 启动 Web 界面"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
